{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d66ce00f",
   "metadata": {},
   "source": [
    "<h1> Reviews Are Gold!? On the Link between Item Reviews and Item Preferences</h1>\n",
    "\n",
    "<i>[KARS & ComplexRec Joint Workshop at ACM RecSys 2021](https://kars-workshop.github.io/2021/)</i>\n",
    "\n",
    "The paper features experiments on subsamples of the [5-core Amazon Reviews Dataset (2014)](https://nijianmo.github.io/amazon/index.html). We essentially compare the proposed review-based similarity measure with the cosine similarity measure with respect to a user-based collaborative filtering task.\n",
    "\n",
    "<ol>\n",
    "  <li>Data Set Download</li>  \n",
    "  <li>Sample datasets Head-100, Median-100, Tail-100, and Mix-100</li>  \n",
    "</ol>\n",
    "\n",
    "The review-based similarity can be calculated with the help of the [WMDtestbed](https://github.com/TEichinger/WMDtestbed)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9025d72",
   "metadata": {},
   "source": [
    "<h2> 1. Data Set Download</h2>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f42cd99d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import gzip, json\n",
    "import pandas as pd\n",
    "\n",
    "# loading script by https://nijianmo.github.io/amazon/index.html#subsets\n",
    "       \n",
    "def parse_r(path):\n",
    "    g = gzip.open(path, 'rb')\n",
    "    for l in g:\n",
    "        yield json.loads(l)\n",
    "\n",
    "def getDF(path, n_rows = 100):\n",
    "    i = 0\n",
    "    df = {}\n",
    "    for d in parse_r(path):\n",
    "        df[i] = d\n",
    "        i += 1\n",
    "        \n",
    "        if n_rows is not None:\n",
    "            if i > n_rows:\n",
    "                break\n",
    "    return pd.DataFrame.from_dict(df, orient='index')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3f39636e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# LOAD kcore_5 DATA\n",
    "#############################\n",
    "path = './data/kcore_5.json.gz'\n",
    "df = getDF(path, n_rows = None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5068074",
   "metadata": {},
   "source": [
    "<h1> 2. Sample datasets Head-100, Median-100, Tail-100, and Mix-100 </h1>\n",
    "\n",
    "We sample 4 data sets from the Amazon Reviews 5-core dataset (2014). Every data set holds data on 100 users. Head-100, the top 100 reviewers with entries on the most items, that is rating, review, or both. Median-100, the median 100 users. Tail-100, the tail 100 users. And last but not least Mix-100, a mixture of all prior three datasets at a ratio of 33:34:33.\n",
    "\n",
    "Training-test splits are performed at a ratio of 80-20, where sampling is performed per-user, that is every user has about 80% training and 20% test entries.\n",
    "\n",
    "The below code exports the samples in the traditional (userId, movieId, rating, timestamp) .csv format, and the reviews as a concatenation on blanks (' ') of reviews of a user. The resulting text files are of the format \"<userId>.txt\", where <userId> some user identifier in the Amazon Reviews 5-core data set (2014).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "f38dbff7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def export_sample(rating_df, sample_name, output_dir = './data/5core_5/samples', traintest_seed = None):\n",
    "    \"\"\" EXPORT RATINGS\n",
    "         the format of user-item matrix is \n",
    "         userId, movieId, rating, timestamp \n",
    "          1, 2, 4.5, 1\n",
    "         ...\n",
    "    \"\"\"\n",
    "    # define output_path\n",
    "    output_dir = os.path.join(output_dir, sample_name)\n",
    "    file_name = sample_name \n",
    "    if traintest_seed is not None:\n",
    "        file_name += '_ttseed={}'.format(traintest_seed) \n",
    "    file_name += '_ratings.csv'\n",
    "    sample_ratings_output_path = os.path.join(output_dir, file_name) \n",
    "\n",
    "    # rename columns\n",
    "    rating_df = rating_df.rename(columns={'reviewerID':'userId', 'asin':'itemId', 'overall':'rating',\\\n",
    "                                         'unixReviewTime':'timestamp'})\n",
    "    # drop unnecessary columns\n",
    "    rating_df = rating_df.drop(['reviewerName', 'helpful', 'reviewText', 'summary', 'reviewTime'], axis=1)\n",
    "    \n",
    "    # make a directories, if none exists already\n",
    "    if not os.path.isdir(output_dir):\n",
    "        os.makedirs(output_dir)\n",
    "\n",
    "    # ratings:\n",
    "    # \n",
    "    # \t\t\t\tuserId \t\titemId \trating \ttimestamp\n",
    "    # 0 \tA23PISU0ZLW71C \t0000029831 \t5.0 \t1393200000\n",
    "    # 1 \tAAE8WBNKHQPL5 \t0000031887 \t5.0 \t1354752000\n",
    "    # 2 \tA2T247H3WD9NS0 \t0000031887 \t5.0 \t1391040000\n",
    "    # 3 \tA1YJJG9T2P6VNS \t0000031887 \t5.0 \t1388620800\n",
    "    # 4 \tA2B62F0GQMUVY0 \t0000031887 \t2.0 \t1352764800\n",
    "    # 5 \tA12OFS8WQP86O5 \t0000031887 \t5.0 \t1297123200\n",
    "    # 6 \tA2UWVZDUHXTBPI \t0000031887 \t5.0 \t1360886400\n",
    "    # 7 \tAEXHAUK8HZPV2 \t0000031887 \t4.0 \t1361145600\n",
    "    # 8 \tA2M2APVYIB2U6K \t0000031887 \t5.0 \t1356220800\n",
    "    # 9 \tA3EERSWHAI6SO \t0000031887 \t5.0 \t1349568000\n",
    "    #\n",
    "    # write to csv file\n",
    "    rating_df.to_csv(sample_ratings_output_path, index = False)\n",
    "    print(\"Exported sample ratings to {}.\".format(sample_ratings_output_path))\n",
    "    return\n",
    "\n",
    "\n",
    "def export_reviews(ratings_df, sample_name, traintest_seed, output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/'):\n",
    "    \"\"\" EXPORT REVIEWS per 'reviewerID' in a distinct <reviewerID>.txt file. The reviews are string-concatenated with \" \"\n",
    "    \"\"\"\n",
    "    output_dir = os.path.join(output_dir, sample_name + '_ttseed={}'.format(traintest_seed))\n",
    "    # make a directories, if none exists already\n",
    "    if not os.path.isdir(output_dir):\n",
    "        os.makedirs(output_dir)\n",
    "\n",
    "    # make a copy of the ratings_df; we make a deep copy such that we do not alter the original dataframe\n",
    "    # the copy will be garbadge collected by Python once the process executing this function is terminated\n",
    "    df_copy = ratings_df.loc[:,[\"reviewerID\"]].copy()\n",
    "    # concatenate all reviews\n",
    "    df_copy['concat_reviews'] = ratings_df.groupby([\"reviewerID\"])[\"reviewText\"].transform(lambda x : ' '.join(x))\n",
    "    # drop duplicates (as every row holds a concatenation of reviews)\n",
    "    df_copy = df_copy.drop_duplicates()\n",
    "    \n",
    "    # export .csv files for every user\n",
    "    users = ratings_df.loc[:,'reviewerID'].drop_duplicates().tolist()\n",
    "    \n",
    "    \n",
    "    for user in users:\n",
    "        user_reviews = df_copy.loc[df_copy.loc[:,'reviewerID'] == user].loc[:,'concat_reviews'].tolist()\n",
    "        user_reviews = user_reviews[0]\n",
    "        with open(os.path.join(output_dir, user+'.txt'), mode = \"w\") as f:\n",
    "            f.write(user_reviews)\n",
    "                  \n",
    "            \n",
    "    print(\"Exported sample reviews to {}.\".format(output_dir))\n",
    "    return\n",
    "\n",
    "\n",
    "def make_train_test_split_per_user(rating_df, train_frac = 0.8, traintest_seed = 1, username_col_name = \"reviewerID\"):\n",
    "    \"\"\"\n",
    "    Input:\n",
    "        - rating_df: pandas.DataFrame im coordinate format\n",
    "\n",
    "    Output:\n",
    "        - train_df, test_df\n",
    "    \"\"\"\n",
    "    train_df = rating_df.groupby(username_col_name).sample(frac=train_frac, random_state = traintest_seed)\n",
    "    test_df_indices =  set(rating_df.index).difference(set(train_df.index))\n",
    "    test_df = rating_df.loc[test_df_indices,:]\n",
    "    \n",
    "    return train_df, test_df "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9af488cc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exported sample ratings to ./data/kcore_5/samples/kcore5_head_100/kcore5_head_100_ttseed=no-split_ratings.csv.\n",
      "Exported sample ratings to ./data/kcore_5/samples/kcore5_median_100/kcore5_median_100_ttseed=no-split_ratings.csv.\n",
      "Exported sample ratings to ./data/kcore_5/samples/kcore5_tail_100/kcore5_tail_100_ttseed=no-split_ratings.csv.\n",
      "Exported sample ratings to ./data/kcore_5/samples/kcore5_mix_100/kcore5_mix_100_ttseed=no-split_ratings.csv.\n",
      "Exported sample reviews to ../BA-llawrenz/repo/data/experiments/amazon_experiments/kcore5_head_100_train_ttseed=no-split.\n",
      "Exported sample reviews to ../BA-llawrenz/repo/data/experiments/amazon_experiments/kcore5_median_100_train_ttseed=no-split.\n",
      "Exported sample reviews to ../BA-llawrenz/repo/data/experiments/amazon_experiments/kcore5_tail_100_train_ttseed=no-split.\n",
      "Exported sample reviews to ../BA-llawrenz/repo/data/experiments/amazon_experiments/kcore5_mix_100_train_ttseed=no-split.\n"
     ]
    }
   ],
   "source": [
    "# Remark: Uncomment below code-lines to generate train-test splits of the Head-100, Median-100, \n",
    "#         Tail-100, and Mix-100 data sets \n",
    "\n",
    "# Set parameters\n",
    "#################\n",
    "# sample seed to sample users in Mix-100 from Head-100, Median-100, and Tail-100 at a ratio of 33:34:33\n",
    "sample_seed = 12345  # fixed\n",
    "# seed for training/test splits\n",
    "#traintest_seed = 426 # 5 distinct traintest_seeds: [426, 72620, 17298, 55142, 8614]\n",
    "# fraction of training entries\n",
    "#train_frac = 0.8     # fixed\n",
    "###############################################\n",
    "\n",
    "# 1. Sample user_ids for Head-100, Median-100, Tail-100, and Mix-100\n",
    "#######################################################################\n",
    "users_head = count_df[:100]\n",
    "# middle_index: index between 0 and len(count_df)\n",
    "median_index = (len(count_df)//2)\n",
    "users_median = count_df[median_index-50: median_index+50]\n",
    "users_tail = count_df[-100:]\n",
    "# create a mix of the high, median, and low datasets\n",
    "users_mix = pd.concat([users_head.sample(33, random_state= sample_seed),\\\n",
    "                       users_median.sample(34, random_state= sample_seed),\\\n",
    "                       users_tail.sample(33, random_state= sample_seed)], axis = 0)\n",
    "\n",
    "# 2. Select all entries by users in Head-100, Median-100, Tail-100, and Mix-100 \n",
    "################################################################################\n",
    "df_head_100 = df.loc[df.loc[:,'reviewerID'].isin(users_head.index)]\n",
    "df_median_100 = df.loc[df.loc[:,'reviewerID'].isin(users_median.index)]\n",
    "df_tail_100 = df.loc[df.loc[:,'reviewerID'].isin(users_tail.index)]\n",
    "df_mix_100 = df.loc[df.loc[:,'reviewerID'].isin(users_mix.index)]\n",
    "\n",
    "# 3. Split train and test dfs \n",
    "#################################\n",
    "\n",
    "#df_head_100_train_df  , df_head_100_test_df  = make_train_test_split_per_user(df_head_100  , train_frac = train_frac, traintest_seed = traintest_seed)\n",
    "#df_median_100_train_df, df_median_100_test_df= make_train_test_split_per_user(df_median_100, train_frac = train_frac, traintest_seed = traintest_seed)\n",
    "#df_tail_100_train_df  , df_tail_100_test_df  = make_train_test_split_per_user(df_tail_100  , train_frac = train_frac, traintest_seed = traintest_seed)\n",
    "#df_mix_100_train_df   , df_mix_100_test_df   = make_train_test_split_per_user(df_mix_100   , train_frac = train_frac, traintest_seed = traintest_seed)\n",
    "\n",
    "\n",
    "# 4.1 Export ratings.csv file\n",
    "#################################\n",
    "# export all splits\n",
    "export_sample(df_head_100, \"kcore5_head_100\", output_dir = './data/kcore_5/samples', traintest_seed = \"no-split\")\n",
    "export_sample(df_median_100, \"kcore5_median_100\", output_dir = './data/kcore_5/samples', traintest_seed = \"no-split\")\n",
    "export_sample(df_tail_100, \"kcore5_tail_100\", output_dir = './data/kcore_5/samples', traintest_seed = \"no-split\")\n",
    "export_sample(df_mix_100, \"kcore5_mix_100\", output_dir = './data/kcore_5/samples', traintest_seed = \"no-split\")\n",
    "# export trains-splits\n",
    "#export_sample(df_head_100_train_df, \"kcore5_head_100_train\", output_dir = './data/kcore_5/samples', traintest_seed = traintest_seed)\n",
    "#export_sample(df_median_100_train_df, \"kcore5_median_100_train\", output_dir = './data/kcore_5/samples', traintest_seed = traintest_seed)\n",
    "#export_sample(df_tail_100_train_df, \"kcore5_tail_100_train\", output_dir = './data/kcore_5/samples', traintest_seed = traintest_seed)\n",
    "#export_sample(df_mix_100_train_df, \"kcore5_mix_100_train\", output_dir = './data/kcore_5/samples', traintest_seed = traintest_seed)\n",
    "\n",
    "# 4.2 Export review.txt files\n",
    "#################################\n",
    "# export all concatenated reviews\n",
    "export_reviews(  df_head_100, 'kcore5_head_100_train', \"no-split\", output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/')\n",
    "export_reviews(df_median_100, 'kcore5_median_100_train', \"no-split\", output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/')\n",
    "export_reviews(   df_tail_100, 'kcore5_tail_100_train', \"no-split\", output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/')\n",
    "export_reviews(   df_mix_100, 'kcore5_mix_100_train', \"no-split\", output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/')\n",
    "\n",
    "# export concatenated training reviews\n",
    "#export_reviews(  df_head_100_train_df, 'kcore5_head_100_train', traintest_seed, output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/')\n",
    "#export_reviews(df_median_100_train_df, 'kcore5_median_100_train', traintest_seed, output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/')\n",
    "#export_reviews(   df_tail_100_train_df, 'kcore5_tail_100_train', traintest_seed, output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/')\n",
    "#export_reviews(   df_mix_100_train_df, 'kcore5_mix_100_train', traintest_seed, output_dir = '../BA-llawrenz/repo/data/experiments/amazon_experiments/')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ac406ff",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
